More Polars plan optimizations for TPC-DS#22395
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughRefactors many polars_impl TPC‑DS benchmark queries to apply early filtering and column projection, use semi-joins and derived pair sets, and reorder joins/aggregations while preserving final result expressions. ChangesPolars Query Optimization Pattern
🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py`:
- Line 121: The ranking uses Polars' ordinal method which gives unique ranks and
diverges from SQL RANK() semantics; update both occurrences where
pl.col("avg_profit").rank(method="ordinal").alias("rnk") is used (and any
subsequent filters like rnk < 11) to use method="min" instead so tied avg_profit
values receive the same rank with gaps (matching DuckDB/SQL RANK behavior).
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py`:
- Line 177: The aggregation pl.col("ext_sales_price").sum().alias("sales_amt")
returns 0 for all-null groups in Polars, which differs from SQL/DuckDB; replace
this with a conditional aggregation that checks
pl.col("ext_sales_price").count() > 0 and only returns the sum when count>0,
otherwise returns None, mirroring the null-sum handling used in q1/q49 so that
the "sales_amt" column matches SQL semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4b334632-2668-4f86-af10-2ff2b1dba11b
📒 Files selected for processing (19)
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
e94077b to
e1bb3be
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py (1)
121-121: ⚡ Quick winDeduplicate the semi-join probe keys.
Same pattern here:
sr_customer_itemis acting as an existence set for two semi joins, so keeping duplicate(sr_customer_sk, sr_item_sk)rows only makes the join builds heavier. A.unique()should make the prefilter cheaper without changing semantics.♻️ Proposed change
- sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"]) + sr_customer_item = store_returns_filtered.select( + ["sr_customer_sk", "sr_item_sk"] + ).unique()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py` at line 121, sr_customer_item currently keeps duplicate (sr_customer_sk, sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter cheaper by deduplicating it. Update the assignment for sr_customer_item (from store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call .unique() (or the equivalent drop_duplicates()) on the resulting frame so duplicates are removed before using sr_customer_item in the semi-joins.python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py (1)
126-126: ⚡ Quick winDeduplicate the semi-join probe keys.
sr_customer_itemis only used as the RHS ofhow="semi"joins, so duplicate(sr_customer_sk, sr_item_sk)pairs cannot change results but can still bloat both downstream join builds. Adding.unique()here should reduce the work in the hottest part of this rewrite.♻️ Proposed change
- sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"]) + sr_customer_item = store_returns_filtered.select( + ["sr_customer_sk", "sr_item_sk"] + ).unique()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py` at line 126, sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"] from store_returns_filtered but is only used as the RHS of semi-joins (how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat downstream joins; change the creation to deduplicate the probe keys by applying .unique() to the selection (i.e., replace the current store_returns_filtered.select([...]) usage for sr_customer_item with store_returns_filtered.select([...]).unique()) so the semi-join builds operate on distinct keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py`:
- Line 121: sr_customer_item currently keeps duplicate (sr_customer_sk,
sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter
cheaper by deduplicating it. Update the assignment for sr_customer_item (from
store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call
.unique() (or the equivalent drop_duplicates()) on the resulting frame so
duplicates are removed before using sr_customer_item in the semi-joins.
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py`:
- Line 126: sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"]
from store_returns_filtered but is only used as the RHS of semi-joins
(how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat
downstream joins; change the creation to deduplicate the probe keys by applying
.unique() to the selection (i.e., replace the current
store_returns_filtered.select([...]) usage for sr_customer_item with
store_returns_filtered.select([...]).unique()) so the semi-join builds operate
on distinct keys.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 800a9d03-24c5-465a-8998-3ee5af1eaa20
📒 Files selected for processing (19)
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.pypython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
🚧 Files skipped from review as they are similar to previous changes (17)
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
- python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py
|
/ok to test f1a800d |
Description
Hand-tune
polars_implfor 19 TPC-DS benchmark queries inpython/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/. Each rewrite preserves query semantics and only changes how the polars LazyFrame is constructed;duckdb_implis unchanged.The optimizations apply a small set of recurring patterns that the polars optimizer does not (yet) perform automatically:
date_dim,item,store, etc. by literal predicates (year, quarter, month window, category/class/brand) before any join, so the join builds smaller hash tables.store_returns(customer, item) pairs) as semi-join probes against the fact tables, shrinking them before the expensive joins.select(...)only the columns each table contributes before joining, instead of relying on the planner to prune them later.Test plan
Checklist